1 Introduction to Cardiovascular Diseases (CVDs)

1.1 Basic Information About Cardiovascular Diseases (CVDs)

Cardiovascular Disease is a general name for numerous types of diseases including heart attack, stroke, and heart failure. In 2019 an estimated 17.9 million people died from Cardiovascular Diseases and 85% were due to heart attack and stroke.

1.2 Reasons of Choosing Cardiovascular Disease.

As living standards are improving over the past few decades, having a healthy lifestyle and avoiding serious illnesses have become people’s priorities and Cardiovascular diseases are the leading cause of death globally our group is interested in finding out some facts about Cardiovascular Disease such as what exactly Cardiovascular Diseases are. What are some factors that are correlate to Cardiovascular Diseases, among those factors what have the strongest correlation to Cardiovascular Diseases? And would a healthy lifestyle help to prevent people from having cardiovascular disease? Therefore, our group choose this dataset about Cardiovascular Diseases

2 Summary of the Dataset

2.1 About the Dataset

The Cardiovascular Diseases dataset is extracted from Kaggle. It contains information from 70,000 records of patients that provided their age, height, weight, physical condition, and habits. Studying such data can help the public and health organizations identify potential factors that could lead to a higher chance of getting Cardiovascular Diseases, and develop appropriate strategies to reduce the number of patients. The dataset is cardio_train.csv. Before doing any analysis on this dataset, we did some research about what every single variable in the dataset represents, why they are included in this dataset, and what variables we will use in our model. With all those information government agents and the public will know how to adjust their habits or lifestyle to prevent getting Cardiovascular Diseases. Therefore, it is crucial to figure out current problems, and then publish the information.

2.2 Variables

CVD <- cardio_vascular_disease
CVD$gender <- factor(cardio_vascular_disease$gender)
CVD$smoke <- factor(cardio_vascular_disease$smoke)
CVD$alco <- factor(cardio_vascular_disease$alco)
CVD$active <- factor(cardio_vascular_disease$active)
CVD$cardio <- factor(cardio_vascular_disease$cardio)
CVD$cholesterol <- factor(cardio_vascular_disease$cholesterol)
CVD$gluc <- factor(cardio_vascular_disease$gluc)
CVD$weight <- as.integer(cardio_vascular_disease$weight)
CVD$age <- (cardio_vascular_disease$age/365)
levels(CVD$cardio) <- c("No", "Yes")
levels(CVD$gender) <- c("Female", "Male")
levels(CVD$cholesterol) <- c("Normal", "Above Normal", "Well Above Normal")
levels(CVD$gluc) <- c("Normal", "Above Normal", "Well Above Normal")
levels(CVD$smoke) <- c("No", "Yes")
levels(CVD$alco) <- c("No", "Yes")
levels(CVD$active) <- c("No", "Yes")
CVD <- na.omit(CVD)
str(CVD)
## 'data.frame':    70000 obs. of  13 variables:
##  $ id         : int  0 1 2 3 4 8 9 12 13 14 ...
##  $ age        : num  50.4 55.4 51.7 48.3 47.9 ...
##  $ gender     : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : int  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : int  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : int  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: Factor w/ 3 levels "Normal","Above Normal",..: 1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : Factor w/ 3 levels "Normal","Above Normal",..: 1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ alco       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 1 2 2 2 1 ...
##  $ cardio     : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 2 1 1 ...

The variables are the following:

1.id: Respondents’id | id | int

2.Age: Respondents’ age | age | int (years)

3.Height: Respondents’ height | height | int (cm)

4.Weight: Respondents’ weight | weight | num (kg)

5.Gender: Respondents’gender | gender | int (1:Feale|2:Male)

6.Systolic blood pressure: Respondents’ systolic blood pressure | ap_hi | int

7.Diastolic blood pressure: Respondents’ diastolic blood pressure | ap_lo | int

8.Cholesterol: Respondents’ cholesterol level | cholesterol | int (1: normal, 2: above normal, 3: well above normal)

9.Glucose: Respondents’ glucose level | gluc | int (1: normal, 2: above normal, 3: well above normal)

10.Smoking: Whether respondent smokes or not | smoke | int (0:Do not smoke, 1:Smoke)

11.Alcohol intake: Whether respondent drink or not | alco | int (0:Do not drink, 1: Drink)

12.Physical activity: Respondents’ do not physical activity regularly | active | int (0:Workout regularly, 1: No workout regularly)

13.Presence or absence of cardiovascular disease: Respondent present cardiovascular disease | cardio | int (0:No CVD currently present, 1:CVD currently present)

For our data analysis, we converted gender, cholesterol, glucose, smoke, alco, active and cardio into factor variables. Then we converted weight into integer and converted age’s unit into year. Furthermore, we removed all the NA in the data set.

3 Descriptive Statistics

3.1 Categorical Variables EDA

3.1.1 Are there any common in CVD patients?

We choose several factors from the data set and check if there are some factors that have a significant difference between CVD presented and CVD not presented.All the tables below will use percentage.

We start with smoke history and whether CVD are currently presented or not.

library(dplyr)
smoke_con<-aggregate(CVD$cardio, by=list(CVD$smoke,CVD$cardio), FUN=length)
smoke_con<-rename(smoke_con,Smoke = Group.1,CVD_presence = Group.2,CVD = x)
smoke_con$Smoke[smoke_con$Smoke=="0"] <- 'Not smoke'
smoke_con$Smoke[smoke_con$Smoke=="1"] <- 'Smoke'
smoke_con$CVD_presence[smoke_con$CVD_presence=="0"] <- 'Not Present'
smoke_con$CVD_presence[smoke_con$CVD_presence=="1"] <- 'Present'
xkabledply(smoke_con)
Model: Smoke ~ CVD_presence + CVD
Smoke CVD_presence CVD
No No 31781
Yes No 3240
No Yes 32050
Yes Yes 2929

(alcohol intake).

alco_con<-aggregate(CVD$cardio,by=list(CVD$alco,CVD$cardio), FUN=length)
alco_con<-rename(alco_con,alco = Group.1,CVD_presence = Group.2,CVD = x)
alco_con$alco[alco_con$alco=="0"] <- 'Not drink'
alco_con$alco[alco_con$alco=="1"] <- 'Drink'
alco_con$CVD_presence[alco_con$CVD_presence=="0"] <- 'Not Present'
alco_con$CVD_presence[alco_con$CVD_presence=="1"] <- 'Present'
xkabledply(alco_con)
Model: alco ~ CVD_presence + CVD
alco CVD_presence CVD
No No 33080
Yes No 1941
No Yes 33156
Yes Yes 1823

(Physical Activity).

active_con<-aggregate(CVD$cardio, by=list(CVD$active,CVD$cardio), FUN=length)
active_con<-rename(active_con,active = Group.1,CVD_presence = Group.2,CVD = x)
active_con$active[active_con$active=="0"] <- 'Physical Activity'
active_con$active[active_con$active=="1"] <- 'Not Physical Activity'
active_con$CVD_presence[active_con$CVD_presence=="0"] <- 'Not Present'
active_con$CVD_presence[active_con$CVD_presence=="1"] <- 'Present'
xkabledply(active_con)
Model: active ~ CVD_presence + CVD
active CVD_presence CVD
No No 6378
Yes No 28643
No Yes 7361
Yes Yes 27618

3.1.2 Plot comparing different factors

CVD-present vs CVD-not present (smoke)

smoke_pic <- data.frame(table(CVD$smoke,CVD$cardio))
names(smoke_pic) <- c("Smoke","Cardio","Count")
ggplot(data=smoke_pic, aes(x=Smoke, y=Count/sum(Count)*100, fill=Cardio)) + geom_bar(stat="identity")+scale_x_discrete(labels=c("0" = "Not smoke", "1" = "Smoke"))+scale_fill_discrete(labels=c("0" = "No cvd presented", "1" = "CVD presented"))+ggtitle("Percentage of Smoke history and present of CVDs") + ylab("Percentage")

CVD-present vs CVD-not present (alcohol)

alco_pic <- data.frame(table(CVD$alco,CVD$cardio))
names(alco_pic) <- c("alco","Cardio","Count")
ggplot(data=alco_pic, aes(x=alco, y=Count/sum(Count)*100, fill=Cardio)) + geom_bar(stat="identity")+scale_x_discrete(labels=c("0" = "Not Drink", "1" = "Drink"))+scale_fill_discrete(labels=c("0" = "No cvd presented", "1" = "CVD presented"))+ggtitle("Percentage of alcohol intake and present of CVDs") + ylab("Percentage")

CVD-present vs CVD-not present (physical activity)

active_pic <- data.frame(table(CVD$active,CVD$cardio))
names(active_pic) <- c("active","Cardio","Count")
ggplot(data=active_pic, aes(x=active, y=Count/sum(Count)*100, fill=Cardio)) + geom_bar(stat="identity")+scale_x_discrete(labels=c("0" = "Workout regularly", "1" = "Not workout regularly"))+scale_fill_discrete(labels=c("0" = "No cvd presented", "1" = "CVD presented"))+ggtitle("Percentage of workout and present of CVDs") + ylab("Percentage")

3.1.3 Chi-Squared Tests

We used the Chi- squared test to testify if categorical variables are related to cardio vascular disease. We set alpha as 0.1.

Test Smoking

Are they independent ?
- \(H_0\): CVD and smoking are independent.
- \(H_1\): They are not independent.

sctest<- chisq.test(smoke_cvd)
sctest
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  smoke_cvd
## X-squared = 17, df = 1, p-value = 4e-05

Since p-value = 4e-05 < 0.1, we reject the null hypothesis, so smoke has impact on cardio.

Test Alcohol

Are they independent ?
- \(H_0\): CVD and drinking alcohol are independent.
- \(H_1\): They are not independent.

alctest<- chisq.test(al_cvd)
alctest
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  al_cvd
## X-squared = 4, df = 1, p-value = 0.05

Since p-value = 0.05 < 0.1, we reject the null hypothesis, so alco has impact on cardio. However, if we set the alpha = 0.05, we would failed to reject the null hypothesis.

Test Physical Activity

Are they independent ?
- \(H_0\): CVD and physical activity are independent.
- \(H_1\): They are not independent.

act_test<- chisq.test(act_cvd)
act_test
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  act_cvd
## X-squared = 89, df = 1, p-value <2e-16

Since p-value = 2e-16 < 0.1, we reject the null hypothesis, so active has an impact on cardio.

3.2 Continuous Variables EDA to Cardiovascular Diseases (CVDs)

After a brief overview of the data set, it tells that there are 3 continuous variables, age, height and weight.

3.2.1 KDE plot of continuous variables

KDE plot represents the data using a continuous probability density curve in one or more dimensions. We can easily observe the distribution of samples with KDE plot.

CVD_new<- CVD
levels(CVD_new$cardio) <- c("No CVD Present", "CVD Present") 
CVD_new$cardio[CVD_new$cardio=="0"] <- 'No CVD Present'
CVD_new$cardio[CVD_new$cardio=="1"] <- 'CVD Present'
cols<-c("No CVD Present"="green","CVD Present"="blue")
str(CVD_new)
age_kdeplot <- ggplot(data = CVD_new, aes(x = age, color =cardio)) + 
            geom_density(aes(fill = cardio), alpha = 0.8) + 
             scale_fill_manual(values =cols) +
              labs(title="KDEplot for age") +
               labs(x="age", y="density") +
                theme(legend.position="top")
age_kdeplot

height_kdeplot <- ggplot(data = CVD_new, aes(x = height, color = cardio)) + 
            geom_density(aes(fill = cardio), alpha = 0.8) + 
             scale_fill_manual(values =cols) +
              labs(title="KDEplot for height") +
               labs(x="height", y="density") +
                theme(legend.position="top")
height_kdeplot

weight_kdeplot <- ggplot(data = CVD_new, aes(x = weight, color = cardio)) + 
            geom_density(aes(fill = cardio), alpha = 0.8) + 
             scale_fill_manual(values= cols) +
              labs(title="KDEplot for weight") +
               labs(x="weight", y="density") +
                theme(legend.position="top")
weight_kdeplot

As we can see from KDE plot for age, people with higher age are more likely to have CVD. And from KDE plot for height and weight , we can find that people with CVD and without CVD have very similar distributions. From these 3 KDE plots, which means,CVD may be positive correlated with age. Finally, weight may only make a little attribution to people presenting CVD.

3.2.2 Logistic regression for continuous variables

The logit model is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given data set of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1(binary outcome). In this case, cardio is the binary outcome; as s result, we choose logistic regression to predict.

lm1 <- glm(cardio~age, family = binomial(link = "logit"), data = CVD)
lm2 <- glm(cardio~age + weight, family = binomial(link = "logit"), data = CVD)
lm3 <- glm(cardio~age+ height + weight, family = binomial(link = "logit"), data = CVD)

This is the summary of model 1.

xkabledply(lm1, title="summary of model 1")
summary of model 1
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.9280 0.0642 -61.1 0
age 0.0736 0.0012 61.7 0

This is the summary of model 2.

xkabledply(lm2, title="summary of model 2")
summary of model 2
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.8346 0.0791 -73.8 0
age 0.0730 0.0012 60.2 0
weight 0.0262 0.0006 45.1 0

This is the summary of model 3.

xkabledply(lm3, title="summary of model 3")
summary of model 3
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.8421 0.1788 -21.5 0
age 0.0716 0.0012 58.8 0
height -0.0127 0.0010 -12.3 0
weight 0.0285 0.0006 46.4 0
anovat <- anova(lm1,lm2,lm3, test="LRT")
anovat
## Analysis of Deviance Table
## 
## Model 1: cardio ~ age
## Model 2: cardio ~ age + weight
## Model 3: cardio ~ age + height + weight
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1     69998      92985                         
## 2     69997      90810  1     2175   <2e-16 ***
## 3     69996      90656  1      154   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As it is shown above, with small p-values, model 3 is the best.

3.2.3 AUC and ROC Curve

We can use AUC and ROC to measure model 2 and model 3. AUC (Area Under The Curve) - ROC (Receiver Operating Characteristics) curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. For summarize, the Higher the AUC, the better the model is.

For model 1:

prob <- predict(lm1,CVD, type = c("response"))
CVD$prob <- prob
library(pROC)
g<- roc(cardio~prob, data = CVD)
plot(g, main = "ROC curve of model 1")

auc(CVD$cardio, prob)
## Area under the curve: 0.635

The AUC value is 0.635 < 0.8. We think this is not a good fit.

For model 2:

prob <- predict(lm2,CVD, type = c("response"))
CVD$prob <- prob
library(pROC)
g<- roc(cardio~prob, data = CVD)
plot(g, main = "ROC curve of model 2")

auc(CVD$cardio, prob)
## Area under the curve: 0.668

The AUC value is 0.668. As 0.668 is less than 0.8, this model is not good enough.

For model 3:

prob1 <- predict(lm3,CVD, type = c("response"))
CVD$prob1 <- prob1
library(pROC)
g<- roc(cardio~prob1, data = CVD)
plot(g, main = "ROC curve of model 3")

auc(CVD$cardio, prob1)
## Area under the curve: 0.671

The AUC value is 0.671. As 0.671 is less than 0.8 and 0.671 is more than 0.668, this model is not good enough but is better than model 2.

In the 3 continuous variables, cardio has positive correlation with age and weight ; it also has negative correlation withheight . It means when a person with higher weight and shorter height, he has more probabilities to have Cardio Vascular Disease. And age is not significant influence factor to people cardio rates in 3 continuous variables.

4 EDA on the joint effect of influencing factors

From the results of above variables, we could find that some of the above variables have a significant effect on CVD. Next, we choose to graph the groups of variables with relatively large correlation coefficients one by one to explore how they affect the Cardio Vascular Disease.

4.1 Simple correlations

As most of variables are factors, we check their Spearman correlations.

First, we convert all the categorical variables to numeric.

CVDNum = CVD
for(i in 1:13){
  # age, height, weight
  if (!(i %in% c(2, 4, 5))){
    CVDNum[,i] = as.numeric(CVDNum[,i])
              }
}
CVDNum$prob1 <- NULL
CVDNum$prob <- NULL
str(CVDNum)
## 'data.frame':    70000 obs. of  13 variables:
##  $ id         : num  0 1 2 3 4 8 9 12 13 14 ...
##  $ age        : num  50.4 55.4 51.7 48.3 47.9 ...
##  $ gender     : num  2 1 1 2 1 1 1 2 1 1 ...
##  $ height     : int  168 156 165 169 156 151 157 178 158 164 ...
##  $ weight     : int  62 85 64 82 56 67 93 95 71 68 ...
##  $ ap_hi      : num  110 140 130 150 100 120 130 130 110 110 ...
##  $ ap_lo      : num  80 90 70 100 60 80 80 90 70 60 ...
##  $ cholesterol: num  1 3 3 1 1 2 3 3 1 1 ...
##  $ gluc       : num  1 1 1 1 1 2 1 3 1 1 ...
##  $ smoke      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ alco       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : num  2 2 1 2 1 1 2 2 2 1 ...
##  $ cardio     : num  1 2 2 2 1 1 1 2 1 1 ...
CVDcor <- cor(subset(CVDNum, select=-id), method="spearman")

loadPkg("corrplot")

corrplot(CVDcor, type="lower", addCoef.col="black", number.cex=0.5, tl.cex=0.7,title="Spearman Correlation for CVD", mar=c(0,0,1,0))

Larger circle means higher correlation. We can see that CVD has negative correlation with cholesterol and height, which means that person who has shorter height or has high cholesterol is less likely to have Cardio Vascular Disease. Person who smoke and has high level cholesterol is also more likely to have CVD. So it makes sense that cholesterol and smoke have positive correlation, which means most person who smoke also have high level cholesterol .

4.2 Will Age influence the CVD rate with other variables?

a1 <- ggplot(CVD_factor,aes(x = smoke , y = age , fill = cardio)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Smoke") + ylab("Age")

a2 <- ggplot(CVD_factor,aes(x = alco , y = age , fill = cardio)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Alco") + ylab("Age")

a3 <- ggplot(CVD_factor, aes(x=height, y=age, color= cardio)) +
  geom_point(size=.5,alpha=0.4)
a3/ (a1+a2) + plot_annotation(title = 'Age Plot')

First, Age and height show no correlation.

Second, for Smoke, the sample groups of smoking and older age have higher CVD rates.

Finally, for cholesterol, people are also prone to CVD when drinking and older age.

4.3 Will Height influence the CVD rate with other variables?

h1 <- ggplot(CVD_factor,aes(x = active , y = height , fill = cardio)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Active") + ylab("Height")

h2 <- ggplot(CVD_factor, aes(x=weight, y=height, color= cardio)) +
  geom_point(size=0.5,alpha=0.4)
h1/ h2 + plot_annotation(title = 'Height Plot')

First, for Active, the sample groups of having physical activity have little higher CVD rates.

Second, weight and height show a little positive correlation.

4.4 Will Weight influence the CVD rate with other variables?

w1 <- ggplot(CVD_factor,aes(x = active , y = weight , fill = cardio)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Active") + ylab("Weight")

w2 <- ggplot(CVD_factor,aes(x = alco , y = weight , fill = cardio)) +
  geom_violin(alpha = 0.5, aes(linetype=NA)) +
  xlab("Alco") + ylab("Weight")
w1/ w2 + plot_annotation(title = 'Weight Plot')

First, for Active, the sample groups of having activity have little higher CVD rates.

Second, for Alcohol, the sample groups of no alcohol (“0”) have little higher CVD rates.

5 Models Comparison

5.1 Logistic Regression

The first model is logistic regression. We take cardio as dependent variables and we set alpha equal 0.05. So, rest of the 11 variables are going to be the independent variables and we are going to delete the variables that have p-value > 0.05 in the model summary.

df <- CVD
df$id <- NULL
df$prob1 <- NULL
df$prob <- NULL

levels(df$cardio) <- c("No", "Yes")
levels(df$gender) <- c("Female", "Male")
levels(df$cholesterol) <- c("Normal", "Above Normal", "Well Above Normal")
levels(df$gluc) <- c("Normal", "Above Normal", "Well Above Normal")
levels(df$smoke) <- c("No", "Yes")
levels(df$alco) <- c("No", "Yes")
levels(df$active) <- c("No", "Yes")
df <- na.omit(df)

str(df)

First model covering all variables:

logistic_all <- glm(cardio ~ age + gender + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, data = df, family = "binomial")
summary(logistic_all)
xkabledply(logistic_all, title = "Summary of Logistic Model")
Summary of Logistic Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.0841 0.2213 -36.534 0.0000
age 0.0542 0.0013 41.735 0.0000
genderMale 0.0143 0.0211 0.678 0.4974
height -0.0056 0.0012 -4.568 0.0000
weight 0.0152 0.0007 23.024 0.0000
ap_hi 0.0395 0.0006 65.235 0.0000
ap_lo 0.0003 0.0001 4.460 0.0000
cholesterolAbove Normal 0.4222 0.0259 16.286 0.0000
cholesterolWell Above Normal 1.1341 0.0344 32.928 0.0000
glucAbove Normal 0.0301 0.0344 0.876 0.3811
glucWell Above Normal -0.3388 0.0381 -8.894 0.0000
smokeYes -0.1314 0.0332 -3.957 0.0001
alcoYes -0.1695 0.0403 -4.210 0.0000
activeYes -0.2101 0.0210 -9.981 0.0000

With p-value = 0.4974, we delete gender variable.

Second Model:

logistic_1 <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, data = df, family = "binomial")
summary(logistic_1)
xkabledply(logistic_1, title = "Summary of Logistic Model")
Summary of Logistic Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.1439 0.2030 -40.121 0.0000
age 0.0542 0.0013 41.760 0.0000
height -0.0053 0.0011 -4.761 0.0000
weight 0.0152 0.0007 23.024 0.0000
ap_hi 0.0395 0.0006 65.330 0.0000
ap_lo 0.0003 0.0001 4.464 0.0000
cholesterolAbove Normal 0.4218 0.0259 16.274 0.0000
cholesterolWell Above Normal 1.1337 0.0344 32.922 0.0000
glucAbove Normal 0.0301 0.0344 0.874 0.3821
glucWell Above Normal -0.3389 0.0381 -8.899 0.0000
smokeYes -0.1256 0.0321 -3.914 0.0001
alcoYes -0.1681 0.0402 -4.180 0.0000
activeYes -0.2101 0.0210 -9.980 0.0000

With p-value = 0.3821, we delete glucose variable.

Third Model:

logistic_2 <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + smoke + alco + active, data = df, family = "binomial")
summary(logistic_2)
xkabledply(logistic_2, title = "Summary of Logistic Model")
Summary of Logistic Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.1242 0.2028 -40.06 0e+00
age 0.0539 0.0013 41.55 0e+00
height -0.0054 0.0011 -4.86 0e+00
weight 0.0152 0.0007 23.02 0e+00
ap_hi 0.0396 0.0006 65.49 0e+00
ap_lo 0.0003 0.0001 4.48 0e+00
cholesterolAbove Normal 0.4234 0.0250 16.96 0e+00
cholesterolWell Above Normal 0.9855 0.0296 33.27 0e+00
smokeYes -0.1222 0.0320 -3.81 1e-04
alcoYes -0.1641 0.0401 -4.09 0e+00
activeYes -0.2085 0.0210 -9.91 0e+00

With all p-value being small, we accept this model as our final model.
Based on the result, we can say:
age, weight, systolic blood pressure, diastolic blood pressure and cholesterol have positive impact on the log (odds-ratio) of having cardio vascular disease.
height, smoking, alcohol and physical activity have negative impact on the log (odds-ratio) of having cardio vascular disease.

5.1.1 Coefficient Exponential

expcoef = exp(coef(logistic_2))
summary(expcoef)
xkabledply( as.table(expcoef), title = "Exponential of coefficients in Cardio")
Exponential of coefficients in Cardio
x
(Intercept) 0.0003
age 1.0553
height 0.9947
weight 1.0153
ap_hi 1.0404
ap_lo 1.0003
cholesterolAbove Normal 1.5272
cholesterolWell Above Normal 2.6792
smokeYes 0.8850
alcoYes 0.8487
activeYes 0.8118

For exponential of coefficients:
Every year increase in age will increase the odds-ratio having CVD present by a factor of 1.0553.
Every centimeter increase in hight will decrease the odds-ratio having CVD present by a factor of 0.9947.
Every kilogram increase in weight will increase the odds-ratio having CVD present by a factor of 1.0153.
Every unit increase in systolic blood pressure will increase the odds-ratio having CVD present by a factor of 1.0404.
Every unit increase in diastolic blood pressure will increase the odds-ratio having CVD present by a factor of 1.0003.
Having above normal cholesterol, compared to normal cholesterol, is increasing the odds-ratio having CVD present by a factor of 1.5272.
Having well above normal cholesterol, compared to normal cholesterol, is increasing the odds-ratio having CVD present by a factor of 2.6792.
Smoking, compared to not smoking, will decrease the odds-ratio having CVD present by a factor of 0.8850. Drinking, compared to not drinking alcohol, will decrease the odds-ratio having CVD present by a factor of 0.8487.
Having activity, compared to not having activity, will decrease the odds-ratio having CVD present by a factor of 0.8118.

5.1.2 Confusion Matrix

loadPkg("regclass")
confusion_matrix(logistic_2)
cfmatrix1 = confusion_matrix(logistic_2)
accuracy1 <- (cfmatrix1[1,1]+cfmatrix1[2,2])/cfmatrix1[3,3]
precision1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[1,2])
recall1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[2,1])
specificity1 <- cfmatrix1[1,1]/(cfmatrix1[1,1]+cfmatrix1[1,2])
F1_score1 <- 2*(precision1)*(recall1)/(precision1 + recall1)
accuracy1
precision1
recall1
specificity1
F1_score1
xkabledply( confusion_matrix(logistic_2), title = "Confusion matrix from Logit Model" )

From confusion matrix, we conclude that:
- Accuracy = 0.722
- Precision = 0.743
- Recall = 0.677
- Specificity = 0.767
- F1 score = 0.708

5.1.3 ROC and AUC

loadPkg("pROC")
prob <- predict(logistic_2, type = "response")
df$prob <- prob
h <- roc(cardio ~ prob, data = df)
auc(h)
## Area under the curve: 0.785
plot(h)

We have the area under the curve of 0.785, this is not a good fit.

5.1.4 Feature Selection

Feature selection based on adjusted R^2.

loadPkg("leaps")
reg.leaps <- regsubsets(cardio ~ ., data = df, nbest = 1, method = "exhaustive")
plot(reg.leaps, scale = "adjr2", main = "Adjusted R^2")

Feature selection based on BIC.

plot(reg.leaps, scale = "bic", main = "BIC")

Feature selection based on Cp.

plot(reg.leaps, scale = "Cp", main = "Cp")

All three methods have similar results.

5.2 KNN

We could use K-nearest-neighbor algorithm to predict whether patient will get CVD, by looking at K neighbors who share similar attributes.

df$prob <- NULL
str(df)
df_num <- df
for(i in 1:12){
  if (!(i %in% c(1,3:6))){
    df_num[,i] = as.numeric(df_num[,i])
  }
}
str(df_num)

5.2.1 Center and Scale

scale1 <- subset(df_num, select = -cardio)
scale1$cardio <- df$cardio
str(scale1)
scale1[c(1,3,4,5,6)] <- as.data.frame(scale(scale1[c(1,3,4,5,6)], center = TRUE, scale = TRUE))
set.seed(1)
df_sample <- sample(2, nrow(scale1), replace = TRUE, prob = c(0.75,0.25))
df_train <- scale1[df_sample==1, 1:11]
df_test <- scale1[df_sample==2, 1:11]
df.trainLabels <- df[df_sample==1, 12]
df.testLabels <- df[df_sample==2, 12]
str(df_train)
## 'data.frame':    52430 obs. of  11 variables:
##  $ age        : num  -0.436 0.308 -0.248 -0.809 1.263 ...
##  $ gender     : num  2 1 1 1 2 1 1 1 2 2 ...
##  $ height     : num  0.443 -1.018 0.078 -1.018 1.661 ...
##  $ weight     : num  -0.848 0.75 -0.709 -1.265 1.445 ...
##  $ ap_hi      : num  -0.12218 0.07261 0.00768 -0.18711 0.00768 ...
##  $ ap_lo      : num  -0.0882 -0.0352 -0.1413 -0.1944 -0.0352 ...
##  $ cholesterol: num  1 3 3 1 3 1 1 1 1 1 ...
##  $ gluc       : num  1 1 1 1 3 1 1 1 1 1 ...
##  $ smoke      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ alco       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : num  2 2 1 1 2 2 1 2 2 1 ...
str(df_test)
## 'data.frame':    17570 obs. of  11 variables:
##  $ age        : num  -0.748 0.991 1.072 -2.001 -1.103 ...
##  $ gender     : num  2 1 1 2 1 2 2 2 2 2 ...
##  $ height     : num  0.565 -1.627 -0.896 2.027 -0.775 ...
##  $ weight     : num  0.542 -0.5 1.306 1.445 -1.542 ...
##  $ ap_hi      : num  0.13754 -0.05725 0.00768 0.00768 -0.12218 ...
##  $ ap_lo      : num  0.0179 -0.0882 -0.0882 -0.0352 -0.1413 ...
##  $ cholesterol: num  1 2 3 1 1 1 1 1 3 1 ...
##  $ gluc       : num  1 2 1 1 3 1 1 1 1 1 ...
##  $ smoke      : num  1 1 1 2 1 2 1 1 1 1 ...
##  $ alco       : num  1 1 1 2 1 1 1 1 1 1 ...
##  $ active     : num  2 1 2 2 2 2 2 2 1 2 ...

5.2.2 Select K

knn_different_k = sapply(seq(1, 21, by = 2),  
                         function(x) chooseK(x, 
                                             train_set = df_train,
                                             val_set = df_test,
                                             train_class = df.trainLabels,
                                             val_class = df.testLabels))
knn_different_k = data.frame(k = knn_different_k[1,],
                             accuracy = knn_different_k[2,])
library("ggplot2")
ggplot(knn_different_k,
       aes(x = k, y = accuracy)) +
  geom_line(color = "orange", size = 1.5) +
  geom_point(size = 3) + 
  labs(title = "accuracy vs k")

xkabledply((knn_different_k))
Model: k ~ accuracy
k accuracy
1 0.610
3 0.635
5 0.644
7 0.651
9 0.658
11 0.659
13 0.660
15 0.660
17 0.665
19 0.662
21 0.665

k=17 is a decent choice. After k=17, the accuracy start to going up and down.

5.2.3 Evaluation

pred <- knn(train = df_train, test = df_test, cl=df.trainLabels, k=17)
pred
loadPkg("gmodels")
churnPredCross <- CrossTable(df.testLabels, pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  17570 
## 
##  
##               | pred 
## df.testLabels |        No |       Yes | Row Total | 
## --------------|-----------|-----------|-----------|
##            No |      6040 |      2641 |      8681 | 
##               |     0.696 |     0.304 |     0.494 | 
##               |     0.651 |     0.319 |           | 
##               |     0.344 |     0.150 |           | 
## --------------|-----------|-----------|-----------|
##           Yes |      3245 |      5644 |      8889 | 
##               |     0.365 |     0.635 |     0.506 | 
##               |     0.349 |     0.681 |           | 
##               |     0.185 |     0.321 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      9285 |      8285 |     17570 | 
##               |     0.528 |     0.472 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Over the 17570 observations, the model correctly predicted that 6040 patients do not have CVD, and correctly predicted that 5643 patients have CVD.
Accuracy = 0.6648
Precision = 0.6809
Recall = 0.6348
Specificity = 0.6954
F1_score = 0.657

5.3 KNN with selected variables

Based on previous EDA and regression model, we removed gender and gluc.

df_num2 <- subset(df_num, select = -c(gender, gluc, prob))
str(df_num2)
scale2 <- df_num2
scale2[1:5] <- scale(df_num2[1:5], center = TRUE, scale = TRUE)
scale2$cardio <- df$cardio
str(scale2)

5.3.1 Center and Scale

set.seed(1)
df_sample2 <- sample(2, nrow(scale2), replace = TRUE, prob = c(0.75,0.25))
df_train2 <- scale2[df_sample2==1, 1:9]
df_test2 <- scale2[df_sample2==2, 1:9]
df.trainLabel2 <- scale2[df_sample2==1, 10]
df.testLabel2 <- scale2[df_sample2==2, 10]
str(df_train2)
## 'data.frame':    52430 obs. of  9 variables:
##  $ age        : num  -0.436 0.308 -0.248 -0.809 1.263 ...
##  $ height     : num  0.443 -1.018 0.078 -1.018 1.661 ...
##  $ weight     : num  -0.848 0.75 -0.709 -1.265 1.445 ...
##  $ ap_hi      : num  -0.12218 0.07261 0.00768 -0.18711 0.00768 ...
##  $ ap_lo      : num  -0.0882 -0.0352 -0.1413 -0.1944 -0.0352 ...
##  $ cholesterol: num  1 3 3 1 3 1 1 1 1 1 ...
##  $ smoke      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ alco       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : num  2 2 1 1 2 2 1 2 2 1 ...
str(df_test2)
## 'data.frame':    17570 obs. of  9 variables:
##  $ age        : num  -0.748 0.991 1.072 -2.001 -1.103 ...
##  $ height     : num  0.565 -1.627 -0.896 2.027 -0.775 ...
##  $ weight     : num  0.542 -0.5 1.306 1.445 -1.542 ...
##  $ ap_hi      : num  0.13754 -0.05725 0.00768 0.00768 -0.12218 ...
##  $ ap_lo      : num  0.0179 -0.0882 -0.0882 -0.0352 -0.1413 ...
##  $ cholesterol: num  1 2 3 1 1 1 1 1 3 1 ...
##  $ smoke      : num  1 1 1 2 1 2 1 1 1 1 ...
##  $ alco       : num  1 1 1 2 1 1 1 1 1 1 ...
##  $ active     : num  2 1 2 2 2 2 2 2 1 2 ...

5.3.2 Select K

knn_different_k2 = sapply(seq(1, 21, by = 2),  
                         function(x) chooseK(x, 
                                             train_set = df_train2,
                                             val_set = df_test2,
                                             train_class = df.trainLabel2,
                                             val_class = df.testLabel2))
str(knn_different_k2)
##  num [1:2, 1:11] 1 0.612 3 0.642 5 ...
knn_different_k2 = data.frame(k = knn_different_k2[1,],
                             accuracy = knn_different_k2[2,])
library("ggplot2")
ggplot(knn_different_k2,
       aes(x = k, y = accuracy)) +
  geom_line(color = "orange", size = 1.5) +
  geom_point(size = 3) + 
  labs(title = "accuracy vs k")

xkabledply((knn_different_k2))
Model: k ~ accuracy
k accuracy
1 0.612
3 0.642
5 0.654
7 0.660
9 0.663
11 0.668
13 0.672
15 0.670
17 0.670
19 0.671
21 0.670

We should select value k=13 here as it has the highest accuracy.

5.3.3 Evaluation

pred2 <- knn(train = df_train2, test = df_test2, cl = df.trainLabel2, k = 13)
knn.roc.prob <- attr(knn(train = df_train2, test = df_test2, cl = df.trainLabel2, k = 13, prob = T), 'prob')
pred2

Confusion Matrix

churnPredCross2 <- CrossTable(df.testLabel2, pred2, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  17570 
## 
##  
##               | pred2 
## df.testLabel2 |        No |       Yes | Row Total | 
## --------------|-----------|-----------|-----------|
##            No |      6085 |      2596 |      8681 | 
##               |     0.701 |     0.299 |     0.494 | 
##               |     0.657 |     0.312 |           | 
##               |     0.346 |     0.148 |           | 
## --------------|-----------|-----------|-----------|
##           Yes |      3172 |      5717 |      8889 | 
##               |     0.357 |     0.643 |     0.506 | 
##               |     0.343 |     0.688 |           | 
##               |     0.181 |     0.325 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      9257 |      8313 |     17570 | 
##               |     0.527 |     0.473 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Accuracy = 0.6717 Precision = 0.6877 Recall = 0.6432 Specificity = 0.701 F1_score = 0.6616
All the statistic have improved a little bit comparing to the previous KNN model.

AUC and ROC

k13 <- knn(train = df_train2, test = df_test2, cl = df.trainLabel2, k = 13, prob = TRUE)
prob = attr(k13, "prob")
prob <- 2*ifelse(k13 == "-1", 1-prob, prob) - 1
loadPkg("pROC")
roc(df.testLabel2, prob)
plot(roc(df.testLabel2, prob), print.thres = T, print.auc = T, )

5.3.4 Comparison

tab <- matrix(c(0.6583, 0.6733, 0.6308, 0.6866, 0.6514, 0.6717, 0.6877, 0.6432, 0.701, 0.6616, 0.722, 0.743, 0.677, 0.767, 0.708), ncol = 5, byrow = TRUE)
colnames(tab) <- c("Accuracy", "Precision", "Recall", "Specificity", "F1 Score")
rownames(tab) <- c("KNN with all variables", "KNN with selected variables", "Logistic with selected variables")
tab <- as.table(tab)
xkabledply(tab, "Models Comparison")
Models Comparison
Accuracy Precision Recall Specificity F1 Score
KNN with all variables 0.658 0.673 0.631 0.687 0.651
KNN with selected variables 0.672 0.688 0.643 0.701 0.662
Logistic with selected variables 0.722 0.743 0.677 0.767 0.708

Overall, logistic regressio has the best performance among three models.

5.4 Classification Tree

5.4.1 Feature Selection and Depth Selection

First, we change all the variables to numeric variables except cardio.

str(df)
df_num$cardio <- factor(CVD$cardio)
levels(df_num$cardio) <- c("No", "Yes")
str(df_num)

Then, we use randomForest and MeanDecreaseGini to create a list of feature importance of all the variables.

library(randomForest)
fit_im = randomForest(df_num$cardio~., data=df_num)
importance(fit_im)
##             MeanDecreaseGini
## age                     5947
## gender                   401
## height                  3053
## weight                  3477
## ap_hi                   5822
## ap_lo                   2847
## cholesterol             1273
## gluc                     485
## smoke                    247
## alco                     214
## active                   352
varImpPlot(fit_im)

Based on the MeanDecreaseGini, we select age, weight, diastolic blood pressure, height and cholesterol as our variables for classification tree model. We excluded systolic blood pressure due to its high correlation with diastolic blood pressure.

Models Summary with varying depths.

loadPkg("rpart")
loadPkg("caret")



# create an empty dataframe to store the results from confusion matrices
confusionMatrixResultDf = data.frame(Depth = numeric(0), Accuracy  = numeric(0), Sensitivity = numeric(0), Specificity = numeric(0), 
                                      Pos.Pred.Value = numeric(0), Neg.Pred.Value = numeric(0), Precision = numeric(0), Recall = numeric(0),
                                      F1 = numeric(0), Prevalence = numeric(0), Detection.Rate = numeric(0), Detection.Prevalence = numeric(0),
                                      Balanced.Accurary = numeric(0), row.names = NULL)

for (deep in 2:15) {
  kfit <- rpart(cardio ~ age + weight + ap_lo + height + cholesterol, data = df_num, method="class", control = list(maxdepth = deep) )
  # 
  cm = confusionMatrix( predict(kfit, type = "class"), reference = df_num[, "cardio"] ) # from caret library
  # 
  cmaccu = cm$overall['Accuracy']
  # print( paste("Total Accuracy = ", cmaccu ) )
  # 
  cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
  confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
  # print("Other metrics : ")
}

unloadPkg("caret")
xkabledply(confusionMatrixResultDf, title = "Cardio Classificantion Trees summary with varying MaxDepth")
Cardio Classificantion Trees summary with varying MaxDepth
Depth Accuracy Sensitivity Specificity Pos.Pred.Value Neg.Pred.Value Precision Recall F1 Prevalence Detection.Rate Detection.Prevalence Balanced.Accuracy
2 0.670 0.582 0.757 0.706 0.644 0.706 0.582 0.638 0.5 0.291 0.413 0.670
3 0.689 0.831 0.546 0.647 0.764 0.647 0.831 0.728 0.5 0.416 0.643 0.689
4 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
5 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
6 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
7 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
8 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
9 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
10 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
11 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
12 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
13 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
14 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702
15 0.702 0.771 0.632 0.677 0.734 0.677 0.771 0.721 0.5 0.386 0.570 0.702

We choose MaxDepth = 4 for the classificant tree.

5.4.2 Result

set.seed(1000)
cardiofit <- rpart(cardio ~ age + weight + ap_lo + height + cholesterol, data = df_num, method = "class", control = list(maxdepth=4))
printcp(cardiofit)
## 
## Classification tree:
## rpart(formula = cardio ~ age + weight + ap_lo + height + cholesterol, 
##     data = df_num, method = "class", control = list(maxdepth = 4))
## 
## Variables actually used in tree construction:
## [1] age         ap_lo       cholesterol
## 
## Root node error: 34979/70000 = 0.5
## 
## n= 70000 
## 
##     CP nsplit rel error xerror  xstd
## 1 0.32      0       1.0    1.0 0.004
## 2 0.02      1       0.7    0.7 0.004
## 3 0.01      4       0.6    0.6 0.003
## 4 0.01      5       0.6    0.6 0.003
plotcp(cardiofit)

summary(cardiofit)
## Call:
## rpart(formula = cardio ~ age + weight + ap_lo + height + cholesterol, 
##     data = df_num, method = "class", control = list(maxdepth = 4))
##   n= 70000 
## 
##       CP nsplit rel error xerror    xstd
## 1 0.3223      0     1.000  1.005 0.00378
## 2 0.0220      1     0.678  0.678 0.00358
## 3 0.0149      4     0.612  0.612 0.00349
## 4 0.0100      5     0.597  0.597 0.00346
## 
## Variable importance
##       ap_lo         age cholesterol      weight 
##          66          20          13           1 
## 
## Node number 1: 70000 observations,    complexity param=0.322
##   predicted class=No   expected loss=0.5  P(node) =1
##     class counts: 35021 34979
##    probabilities: 0.500 0.500 
##   left son=2 (49369 obs) right son=3 (20631 obs)
##   Primary splits:
##       ap_lo       < 85.5 to the left,  improve=4380.0, (0 missing)
##       cholesterol < 1.5  to the left,  improve=1480.0, (0 missing)
##       age         < 55   to the left,  improve=1350.0, (0 missing)
##       weight      < 75.5 to the left,  improve= 834.0, (0 missing)
##       height      < 156  to the right, improve=  16.1, (0 missing)
##   Surrogate splits:
##       weight < 98.5 to the left,  agree=0.708, adj=0.01, (0 split)
##       height < 58   to the right, agree=0.705, adj=0.00, (0 split)
## 
## Node number 2: 49369 observations,    complexity param=0.022
##   predicted class=No   expected loss=0.385  P(node) =0.705
##     class counts: 30343 19026
##    probabilities: 0.615 0.385 
##   left son=4 (28886 obs) right son=5 (20483 obs)
##   Primary splits:
##       age         < 55.1 to the left,  improve=1160.00, (0 missing)
##       cholesterol < 2.5  to the left,  improve=1070.00, (0 missing)
##       ap_lo       < 73.5 to the left,  improve= 422.00, (0 missing)
##       weight      < 79.5 to the left,  improve= 326.00, (0 missing)
##       height      < 156  to the right, improve=   6.55, (0 missing)
##   Surrogate splits:
##       cholesterol < 2.5  to the left,  agree=0.609, adj=0.059, (0 split)
##       height      < 150  to the right, agree=0.587, adj=0.005, (0 split)
##       weight      < 39.5 to the right, agree=0.585, adj=0.000, (0 split)
##       ap_lo       < 80.5 to the left,  agree=0.585, adj=0.000, (0 split)
## 
## Node number 3: 20631 observations
##   predicted class=Yes  expected loss=0.227  P(node) =0.295
##     class counts:  4678 15953
##    probabilities: 0.227 0.773 
## 
## Node number 4: 28886 observations,    complexity param=0.0149
##   predicted class=No   expected loss=0.294  P(node) =0.413
##     class counts: 20390  8496
##    probabilities: 0.706 0.294 
##   left son=8 (27295 obs) right son=9 (1591 obs)
##   Primary splits:
##       cholesterol < 2.5  to the left,  improve=460.00, (0 missing)
##       age         < 43.8 to the left,  improve=228.00, (0 missing)
##       ap_lo       < 77.5 to the left,  improve=221.00, (0 missing)
##       weight      < 79.5 to the left,  improve=192.00, (0 missing)
##       height      < 164  to the left,  improve=  5.15, (0 missing)
## 
## Node number 5: 20483 observations,    complexity param=0.022
##   predicted class=Yes  expected loss=0.486  P(node) =0.293
##     class counts:  9953 10530
##    probabilities: 0.486 0.514 
##   left son=10 (17692 obs) right son=11 (2791 obs)
##   Primary splits:
##       cholesterol < 2.5  to the left,  improve=357.00, (0 missing)
##       age         < 61.1 to the left,  improve=225.00, (0 missing)
##       weight      < 72.5 to the left,  improve= 91.10, (0 missing)
##       ap_lo       < 78.5 to the left,  improve= 79.90, (0 missing)
##       height      < 156  to the right, improve=  3.61, (0 missing)
## 
## Node number 8: 27295 observations
##   predicted class=No   expected loss=0.273  P(node) =0.39
##     class counts: 19855  7440
##    probabilities: 0.727 0.273 
## 
## Node number 9: 1591 observations
##   predicted class=Yes  expected loss=0.336  P(node) =0.0227
##     class counts:   535  1056
##    probabilities: 0.336 0.664 
## 
## Node number 10: 17692 observations,    complexity param=0.022
##   predicted class=No   expected loss=0.477  P(node) =0.253
##     class counts:  9253  8439
##    probabilities: 0.523 0.477 
##   left son=20 (12583 obs) right son=21 (5109 obs)
##   Primary splits:
##       age         < 60.7 to the left,  improve=183.00, (0 missing)
##       ap_lo       < 72.5 to the left,  improve= 66.70, (0 missing)
##       weight      < 79.5 to the left,  improve= 50.90, (0 missing)
##       cholesterol < 1.5  to the left,  improve= 22.30, (0 missing)
##       height      < 182  to the right, improve=  1.36, (0 missing)
##   Surrogate splits:
##       ap_lo < 0.5  to the right, agree=0.711, adj=0, (0 split)
## 
## Node number 11: 2791 observations
##   predicted class=Yes  expected loss=0.251  P(node) =0.0399
##     class counts:   700  2091
##    probabilities: 0.251 0.749 
## 
## Node number 20: 12583 observations
##   predicted class=No   expected loss=0.431  P(node) =0.18
##     class counts:  7158  5425
##    probabilities: 0.569 0.431 
## 
## Node number 21: 5109 observations
##   predicted class=Yes  expected loss=0.41  P(node) =0.073
##     class counts:  2095  3014
##    probabilities: 0.410 0.590

There is our classification tree:

plot(cardiofit, uniform = TRUE, main = "Classification Tree for Cardio")
text(cardiofit, use.n = TRUE, all = TRUE, cex = 1)

Same tree with different looking:

loadPkg("rpart.plot")
loadPkg("rattle")
rpart.plot(cardiofit)

fancyRpartPlot(cardiofit)

5.4.3 Evaluation

Confusion Matrix

loadPkg("caret")
cm = confusionMatrix( predict (cardiofit, type = "class"), reference = df_num[, "cardio"])
print('Overall: ')
## [1] "Overall: "
cm$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##       7.02e-01       4.04e-01       6.98e-01       7.05e-01       5.00e-01 
## AccuracyPValue  McnemarPValue 
##       0.00e+00      1.14e-247
print('Class')
## [1] "Class"
cm$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##                0.771                0.632                0.677 
##       Neg Pred Value            Precision               Recall 
##                0.734                0.677                0.771 
##                   F1           Prevalence       Detection Rate 
##                0.721                0.500                0.386 
## Detection Prevalence    Balanced Accuracy 
##                0.570                0.702
xkabledply(cm$table, "confusion matrix")
confusion matrix
No Yes
No 27013 12865
Yes 8008 22114

AUC and ROC

library(rpart)
rp <- rpart(cardio ~., data = df_num)
library(ROCR)
pred <- prediction(predict(cardiofit, type = "prob")[,2], df_num$cardio)
tree.predict.prob <- predict(cardiofit, type = "prob")[,2]
plot(performance(pred, "tpr", "fpr"), main = "ROC Cardio")

auc = performance(pred, 'auc')
slot(auc, 'y.values')
## [[1]]
## [1] 0.735

5.5 Comparison

tab1 <- matrix(c(0.672, 0.688, 0.643, 0.701, 0.662, 0.722, 0.743, 0.677, 0.767, 0.708, 0.702, 0.677, 0.771, 0.632, 0.721), ncol = 5, byrow = TRUE)
colnames(tab1) <- c("Accuracy", "Precision", "Recall", "Specificity", "F1 Score")
rownames(tab1) <- c("KNN with selected variables", "Logistic with selected variables", "Classification Tree with selected variables")
tab1 <- as.table(tab1)
xkabledply(tab1, "Model Comparison")
Model Comparison
Accuracy Precision Recall Specificity F1 Score
KNN with selected variables 0.672 0.688 0.643 0.701 0.662
Logistic with selected variables 0.722 0.743 0.677 0.767 0.708
Classification Tree with selected variables 0.702 0.677 0.771 0.632 0.721

Conclusion

We conducted a research about cardiovascular diseases. The main question was “does healthy lifestyle work?”.

After analyzing the dataset and the dataframe, we came up with the following conclusions.As we said we are looking through a set of data showing the factors impacting the cardio vascular disease. Those factors are smoking, alcohol intake, physical activity, age, weight, height, blood pressure and cholesterol.

According to the EDA analysis, the CVD is present in those who do not smoke, do not drink and do not workout. According to chisquare we found out that smoke, alcohol intake and physical activity have an impact on CVD According to the KDE plot, old people and weight close to 100 have CVD. According to spearman correlation, cholesterol doesnt have an impact on CVD

We also conducted logistic regression, KNN and classification Tree; The logistic regression showed that age, weight, blood pressure and cholesterol have an impact on CVD. The KNN showed that age and weight influence CVD. The classification tree showed that age, weight, height, blood pressure and cholesterol have an impact on CVD.

From these conclusions we can get the characteristics of CVD group pf people which is no smoking, no drinking, no physical activity, old people and weight close to 100. This means that a healthy lifestyle does not really work in this case.

Combined with the above analysis, CVD is mainly due to age and weight.

Some suggestions for the high level of CVD group will be to smoke, drink,workout and lose weight. This means a person can smoke and drink but has to do some physical activity to loose weight.